Data Streams and Applications in Computer Science

نویسنده

David P. Woodruff

چکیده

This is a short survey of my work in data streams and related applications, such as communication complexity, numerical linear algebra, and sparse recovery. The goal is give a non-technical overview of results in data streams, and highlight connections between these different areas. It is based on my Presburger lecture award given at ICALP, 2014. 1 The Data Stream Model Informally speaking, a data stream is a sequence of data that is too large to be stored in available memory. The data may be a sequence of numbers, points, edges in a graph, and so on. There are many examples of data streams, such as internet search logs, network traffic, sensor networks, and scientific data streams (such as in astronomics, genomics, physical simulations, etc.). The abundance of data streams has led to new algorithmic paradigms for processing them, which often impose very stringent requirements on the algorithm’s resources. Formally, in the streaming model, there is a sequence of elements a1, . . . , am presented to an algorithm, where each element is drawn from a universe [n] = {1, . . . , n}. The algorithm is allowed a single or a small number of passes over the stream. In network applications, the algorithm is typically only given a single pass, since if data on a network is not physically stored somewhere, it may be impossible to make a second pass over it. In other applications, such as when data resides on external memory, it may be streamed through main memory a small number of times, each time constituting a pass of the algorithm. The algorithm would like to compute a function or relation of the data stream. Because of the sheer size, for many interesting problems the algorithm is necessarily randomized and approximate in order to be efficient. One should note that the randomness is in the random coin tosses of the algorithm rather than in the stream. That is, with high probability over the coin tosses of the algorithm, the algorithm should correctly compute the function or relation for any stream that is presented to it. This is more robust than say, if the algorithm assumed particular orderings of the stream that could make the problem easier to solve. In this survey we focus on computing or approximating order-independent functions f (a1, . . . , am). A function is order-independent if applying any permutation to its inputs results in the same function value. As we will see, this is often the case in numerical applications, such as if one is interested in the number of distinct values in the sequence a1, . . . , am. One of the main goals of a streaming algorithm is to use as little memory as possible in order to compute or approximate the function of interest. The amount of memory used (in bits) is referred to as the space complexity of the algorithm. While it is always possible for the algorithm to store the entire sequence a1, . . . , am, this is usually extremely prohibitive in applications. For example, internet routers often have limited resources; asking them to store a massive sequence of network traffic is infeasible. Another goal of streaming algorithms is their processing time, i.e., how often it takes to update their memory contents when presented with a new item in the stream. Often items in streams are presented at very high speeds and the algorithm needs to quickly update the data structures in its memory in order to be ready to process future updates. For order-independent functions, we can think of the stream as an evolution of an underlying vector x ∈ Rn. That is, x is initialized to the all zero vector, and when the item i appears in the stream, x undergoes the update

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows

Todays, in many modern applications, we search for frequent and repeating patterns in the analyzed data sets. In this search, we look for patterns that frequently appear in data set and mark them as frequent patterns to enable users to make decisions based on these discoveries. Most algorithms presented in the context of data stream mining and frequent pattern detection, work either on uncertai...

متن کامل

Data Replication-Based Scheduling in Cloud Computing Environment

Abstract— High-performance computing and vast storage are two key factors required for executing data-intensive applications. In comparison with traditional distributed systems like data grid, cloud computing provides these factors in a more affordable, scalable and elastic platform. Furthermore, accessing data files is critical for performing such applications. Sometimes accessing data becomes...

متن کامل

An Intelligent Computer Interface Utilizing Parallel Picocontrollers (TECHNICAL NOTE)

The design of an interface unit is described, in which RS232 serial data is converted to latched parallel data on 22 independent lines. The data direction of each line is programmable through the serial port. Two picocontrollers are employed in a parallel processing mode to give the required number of I/O pins, and data on the shared serial line is coded to separate data streams to the individu...

متن کامل

E2DR: Energy Efficient Data Replication in Data Grid

Abstract— Data grids are an important branch of gird computing which provide mechanisms for the management of large volumes of distributed data. Energy efficiency has recently emerged as a hot topic in large distributed systems. The development of computing systems is traditionally focused on performance improvements driven by the demand of client's applications in scientific and business domai...

متن کامل

Window Queries over Data

An abstract of the dissertation of Jin Li for the Doctor of Philosophy in Computer Science presented October 17, 2008. Title: Window Queries over Data Streams Evaluating queries over data streams has become an appealing way to support various stream-processing applications. Window queries are commonly used in many stream applications. In a window query, certain query operators, especially block...

متن کامل

Scaling Up for High Dimensional Data in Data Stores and Streams

The data in engineering and science has been on a massive scale and stored in gigantic storage devices. The data is moved in and out in the form of data streams. Data storage levels are reaching Yottabytes in terms of storage. Science and engineering transforms such data into rich and resourceful data. Intensive methods have been researched for high dimensionality. Science also uses high speed ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bulletin of the EATCS

دوره 114 شماره

صفحات -

تاریخ انتشار 2014

Data Streams and Applications in Computer Science

نویسنده

چکیده

منابع مشابه

Mining Frequent Patterns in Uncertain and Relational Data Streams using the Landmark Windows

Data Replication-Based Scheduling in Cloud Computing Environment

An Intelligent Computer Interface Utilizing Parallel Picocontrollers (TECHNICAL NOTE)

E2DR: Energy Efficient Data Replication in Data Grid

Window Queries over Data

Scaling Up for High Dimensional Data in Data Stores and Streams

عنوان ژورنال:

اشتراک گذاری